Sepsis Prediction, Code Example

Author: Katrina Adams
Data: Publicly avaialable from 2019 CINC sepsis prediction challenge
Objective: Predict Sepsis from time series of measurements 4 hours ahead

I chose this use case and began a notebook in preparation for a 3-hour take-home data science exercise as part of the interview process for a healthcare ML company. I started this work before receiving the take-home and it does not contain any data or information from that interview process.

This notebook is an extension of that work to show how I typically organize my code and the rough process that I might go through for an initial experiement on a new data set and use case. I'll show a very light EDA and look at two sets of features with a couple of models and how they perform on a single prediction task. This notebook is not meant to be exhaustive (with notes on everything that I might consider or try for a similar problem), and as we'll see, the models aren't great at the prediction task, but that's OK because this notebook is primarily intended to be a code sample.

Data Dictionary

HR - heart rate
O2Sat - "Oxygen saturation measures how much of the hemoglobin in the red blood cells is carrying oxygen" [1], measured by pulse oximeter [2]
Temp - temperature
SBP - systolic blood pressure
MAP - mean arterial pressure
DBP - diastolic blood pressure
Resp - respiration rate BaseExcess - excess of base present in the blood (negative value is deficit) (mEq/L)
HCO3 - bicarbonate (mEq/L)
FiO2 - fraction of inspired oxygen. I think this might relate to O2 flow-rate provided to the patient, not a physiological measurement of the patient
pH - blood pH
PaCO2 - Partial pressure of carbon dioxide. "This measures the pressure of carbon dioxide dissolved in the blood and how well carbon dioxide is able to move out of the body." [1]
SaO2 - oxygen saturation, measured by blood analysis (e.g. a blood gas) [2]
AST - (aspartate aminotransferase) liver function test [3]
BUN - blood urea nitrogen,"A BUN test can reveal whether your urea nitrogen levels are higher than normal, suggesting that your kidneys or liver may not be working properly" [4]
Alkalinephos - Alkaline Phosphatase
Calcium - calcium in blood
Chloride - chloride in blood (electrolyte) Creatinine - creatinine in blood, indicator of kidney functioning
Bilirubin_direct - "bilirubin once it reaches the liver and undergoes a chemical change" [8] Glucose - blood glucose
Lactate - lactic acid in blood. Elevated lactate can be linked to conditions that prevent someone from getting enough oxygen (eg. sepsis) or metabolic causes that would lead someone to require more oxygen than normal (eg. uncontrolled diabetes) [5]
Magnesium - magnesium in blood, can be high or low indicating different potential problems [6]
Phosphate - phosphate in blood, too high or low can indicate kidney disease [7]
Potassium - potassium in blood (electrolyte)
Bilirubin_total - bilirubin relates to liver functioning
TroponinI - "Troponin tests measure the level of cardiac-specific troponin in the blood to help detect heart injury." [9] Hct - "hematocrit test measures the proportion of red blood cells in your blood" [10]
Hgb - hemoglobin count
PTT - "partial thromboplastin time test measures the time it takes for a blood clot to form" [11]
WBC - white blood count
Fibrinogen - "Fibrinogen is a protein produced by the liver. This protein helps stop bleeding by helping blood clots to form." [12]
Platelets - platelet count, high or low can suggest different causes
Age - patient age
Gender - patient gender
Unit1 - not sure, maybe hospital unit where patient is treated
Unit2 - not sure, maybe hospital unit where patient is treated
HospAdmTime - not sure, maybe hours before patient arrives in ICU
ICULOS - ICU length of stay (hours)
SepsisLabel - It would be important to ask domain experts about the sepsis label and what exactly it means (diagnosis, suspicion, criteria for label, consistency across providers, noisiness of label, etc.)
pname - patient ID
hour - this is just the count of rows for the patient since each row represents data for a patient-our

Globals

Target Label

Our target for prediction is sepsis in 4 hours' time

Train/Dev/Test Sets

For this exercise, I'll just use training and development sets. Normally, a test set could be included to get an estimate for how a model might generalize to unseen data, but I'll skip that for now. Typically, the question that one is trying to answer and the data that they have available will dictate how the data should be split. In this case, I'm assuming that we are interested in prediction within the population that this data is from and will include Unit1 and Unit2 in both the train and dev sets (and test set if we were using one).

I'll split randomly by patient to prevent leakage from training to dev.

Sepsis Label

Class Imbalance: There a more negative than positive cases by about 13:1, and not very many positive cases to learn from in absolute terms. Options for handling class imbalance include:

Look at the data

Let's take a look at Facets for this data as a way to get a quick sense of what's in it and start to think about cleaning, features, and modeling approaches and expectations. I'll do some light cleaning just in the notebook here. Normally, I would have a separate notebooks for different aspects of EDA and then refactor any cleaning/transformations/feature engineering into separate modules that I could run on future data.

Train and Dev Sets

Sepsis Label

Some quick general observations: There's a lot of overlap between sepsis and not-sepsis states in the features. Predicting sepsis is well-known to be a difficult problem and we can see that there won't easy separability with these features as is.

EtCO2 is missing for all rows, so we'll drop this columns

HospAdmTime: I'm not sure what this is recording. Since it's mostly negative, I'm guessing it's the time when the patient was admitted to the hospital before going to the ICU, but would have to ask.

Let's take a look at Unit1 and Unit2

Since Unit1 and Unit2 are missing for almost 40% of patients, we'll drop these for this example. There are other ways to deal with these missing values if we want to use the Unit information.

Aggregate by patient
Look at distributions of features within and among patients

Group by patient and look at Facets to see which features might be missing entirely for some patients and what the distributions look like across patients

Pair-wise Correlations

Some expected correlations here (eg. kidney indicators, pH/carbonate, biliruben, BPs). Could probably do some dimensionality reduction/embedding and maintatin a lot of the variability.

Filtering Data

Since we're interested in predicting patients that will have sepsis in 4 hours' time we will do some filtering of the data set as follows:

Note: Filtering these rows is losing a lot of information that might be helpful for understanding the sepsis state, and could potentially help with feature engineering or other modeling approaches. However, for this example and in the interest of getting something end-to-end quickly, I'll keep things simple for now.

Missing Data

Missing Completely at Random (MCAR) - simplest case for imputation, can impute from pop distr
Missing at Random (MAR) - can impute from observed variables
Missing Not at Random (MNAR) - Missing value is from different distr than observed values (missingness is meaningful)

It is well-known that clinical data is MNAR but we will use some simple imputation here for expediency (assumes MCAR) and we will include a missingness indicator.

Imputation

There are many approaches for imputation of missing data in healthcare. Some of these include:

Imputation plan for this example

  1. Carry-forward for patients that have a value within their history
  2. Missingness indicator for values that are missing after carry forward
    • General notes on missing indicator: A benefit of the missing indicator is that the presence or absence of a test does give us information about a patient's condition and care. It also allows the model to know whether a value was imputed. (There's also research around indicators and fairness that is interesting). A disadvantage of the indicator is that it encodes clinician decisions rather than patient physiology. We wouldn't want an alarm system to tell a clinician that sepsis is more likely whenver they order a test that means they are already suspecting sepsis. The inidcator also increases the dimensionality of the feature space
    • Notes on putting indicator here: There's some subtelty around when we add the missingness indicator for these ICU measurements that will vary with the value that we are imputing. For example, some values are unlikely to change hour-to-hour so a value from a previous hour would be ery likely to represent the current hour. However, carrying something forward 12 hours may or may not be realistic, especially, given the dynamic nature of many patients' in the ICU. There are many different ways we could approach giving a model information about the reliability of a measurement at a certain time but we'll keep it simple for this example.
  3. Simple imputation within patient histories
  4. Simple imputation across patients

Time Series Plots

Classification

I'm showing just two classifiers here: Logistic Regression and XGBoost. I've implemented a class to wrap these classifiers to demonstrate the method (see Helpers.py). We could also use this API for a custom classifier. An interesting classifier in this case could be the sepsis clinical criteria for comparison to the ML methods. Logistic Regression is used here as a baseline model with XGB as a comparison with more capacity.

Objective Function

In the absence of discussions with stakeholders, we'll assume equal costs for FP and FN and evaluate model performance with AUCPR and F1-score. This is not a realistic assumption, and in reality, objective function and operating point would depend on these different costs and how the model might impact care decisions. The unit of measurement will be patient-hour (eg. evaluating false alarms based on the number of times that the model is run, not per patient)

Predict using single hour

Let's look at just using the data for the current hour to predict whether the patient will have a sepsis in 4 hours. This loses a lot of information about the patient's history and trajectory but might be an interesting baseline to start with.

Make X and y using all patient data columns except those related to the hospital stay (Units and time columns)

Train logistic regression and XGBoost classifiers and look at results

So many false positives! This is common with predictions of rare events and medical alert ML systems. There's certainly a large gap between the performance of these models and where we'd like to be (high bias), and a significant gap between dev and train for XGBoost (overfitting and high variance for XGB). Overitting is a common problem for boosted decision trees. Typical approaches to reduce bias include increasing the flexibility of the model and improving the model training (eg. optimization). Typical approaches to reduce variance include getting more data and increasing regularization (and different models). I'm skipping hyperparameter tuning for now but, in general, this process can improve model performance.

Feature Engineering

Let's add some patient history to the features to see if changes over time help make a prediction. The Insight sepsis prediction [13] uses lagged values and deltas of vitals, so that might be a good place to start. From looking at the time series plots for sepsis and non-sepsis patients, it looks like the vitals trends might be useful features. Therefore, I'll include vitals at 2-, 4-, and 6-hour lags as well as the deltas between those lags and the most recent value.

These delta are quite noisy (especially with simple imputation), a smoother measure of trend might perform better

Remove rows from too early in the patient history to have lagged features (first 6 hours)

Predict using lagged vitals

Make X and y with lagged features

Train logistic regression and XGBoost classifiers and look at results

It doesn't look like these lagged features have helped much, except it looks like we're overfitting even more with XGB. Both of these classifiers allow us to inspect feature contributions so that could be an interesting next step to see what the models are learning. It would also be interesting to inspect the types of cases that the models are mis-classifying.

Some Additional Thoughts

References (Links)

Data Dictionary
[1] https://www.uofmhealth.org/health-library/hw2343#:~:text=Test%20Overview,carbon%20dioxide%20from%20the%20blood
[2] https://emupdates.com/spo2-sao2-pao2/
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4267040/
[4] https://www.mayoclinic.org/tests-procedures/blood-urea-nitrogen/about/pac-20384821
[5] https://www.webmd.com/a-to-z-guides/what-is-a-lactic-acid-blood-test
[6] https://medlineplus.gov/lab-tests/magnesium-blood-test/
[7] https://medlineplus.gov/lab-tests/phosphate-in-blood/
[8] https://www.webmd.com/a-to-z-guides/bilirubin-test#2
[9] https://labtestsonline.org/tests/troponin
[10] https://www.mayoclinic.org/tests-procedures/hematocrit/about/pac-20384728 [11] https://medlineplus.gov/lab-tests/partial-thromboplastin-time-ptt-test/
[12] https://medlineplus.gov/ency/article/003650.htm

Insight Sepsis prediction
[13] https://bmjopen.bmj.com/content/8/1/e017833